;;; -*- Mode: TEXT -*- ;;; File: AutoClass:doc;Design.text ;;;--------------------------------------------------------------------------;;; ;;; AUTOCLASS 3.0 Released 5/11/90 contact: Taylor@pluto.arc.nasa.gov ;;; ;;; by P. Cheeseman, J. Stutz, R. Hanson, W. Taylor ;;; ;;; NASA Ames Research Center, MS 244-17, Moffett Field, CA 94035 ;;; ;;; ;;; ;;; Copyright (C) 1990 Research Institute for Advanced Computer Science. ;;; ;;; All rights reserved. The RIACS Software Policy contains specific ;;; ;;; terms and conditions on the use of this software, and must be ;;; ;;; distributed with any copies. THIS FILE MAY BE REDISTRIBUTED. This ;;; ;;; copyright and notice must be preserved in all copies made of this file. ;;; ;;;--------------------------------------------------------------------------;;; This file gives an intermediate level description of the AutoClass-3 program design. It was originally generated as an informal system specification and development notebook. Objectives: 1 - A modular implementation of the likelihood function that allows easy extension of the set of attribute interactions that can be described. This contrasts with AutoClass 2 in which the single interaction type (conditional independence) was built into the code, and the only alternative was to ignore some attributes. 2 - User specification of the current likelihood function, or class model, at run time. This is envisioned as being a product of conditionally independent likelihood terms over selected data attributes. The user chooses the type of interaction, for specific sets of attributes, from a range of prespecified interaction types. Note that this probably pays too little attention to the problem of searching the model space. In the block covariant normal there are many combinations of blocks possible. 3 - Minimize runtime by compiling the class likelihood function into a single function over the class parameters and a datum's attributes. This has been extended to a variety of class specific functions. 4 - For developmental flexibility it is desired to: a - Be able to maintain multiple classifications. b - Be able to maintain multiple models. 5 - To support future research it is desired to: a - Define classifications that contain multiple class likelihood models. b - Define hierarchical models capable of runtime optimization. Constraints: 1 - The range of the calculated probabilities forces us to perform our calculations in terms of logarithms. Probability normalization may then result in underflow, and produce nominal zeros. Terminology: I will speak of probability and likelihood distribution functions interchangeably. They are in fact exactly the same function form, used to calculate the probability of data with respect to fixed parameters or the likelihood of parameters with respect to a fixed set of data. Implementation: A classification is defined in terms of the probability model(s) and classes which instantiate the model(s). The classification is made with respect to some particular database. A classification is implemented as a classification-$ structure (short name is clsf-$) which contains several parameters, a database pointer, a vector of model pointers, and an adjustable fill-pointer vector of class pointers. See the files ..>prog>struct-model.lisp, ..>prog>struct-clsf.lisp, and ..>prog>struct-class.lisp. Models, classifications, and classes are implemented in an object oriented manner using Common Lisp structures with supplementary functions having the same prefix as the corresponding structure accessors. A probability model is defined in terms of attribute interactions for a particular database. The model partitions the attributes into subsets (type att-set-$) whose members interact according to a particular probability function term. Within the model, the subsets are assumed to be conditionally independent (given the class). Thus the inter-set probabilities are multiplicatively combined and the model specifies a probability function term for each att-set. The model also defines the parameter structure of instance classes, provides the priors, and holds the names of the runtime compiled model dependent functions (log-likelihood, update-xxxx, &etc.) used for generic operations on it's classes. A user specifies the model in an xxx.model file that is interpreted by functions on the ..>prog>i/o-read-model file. The runtime definition of model/database specific functions is carried out by invocation of the expand-model-terms function and the expand-xxx-fn functions in file ..>prog>model-expander. The probability function terms specify how the element(s) of an attribute subset interact to produce a probability. Addition of a new probability term type requires specification of priors,the parameter structures, and a set of functions for the likelihood term, the class statistics based likelihood and marginal term approximations, and certain auxiliary terms. Runtime model expansion will then produce a set of model specific class functions which call the appropriate probability terms with precompiled arguments. See the files ..>prog>model-xxx.lisp, particularly ..prog>model-expander. The currently defined probability terms are single-multinomial, single-normal-cn (for Constant observation error, No missing values), and single-normal-cm (for Constant observation error, Missing values present). A class is an instance of a likelihood model within a classification. It consists of: a. Structures instantiating the parameter set (and auxiliary variables) of the model. b. Functions to reference the model dependent functions. c. A vector of the class weights (probabilities) for each datum. d. Various class specific parameters. See file ...>prog>struct-class.lisp. A data-base contains descriptive information and a vector of data instance vectors of attribute values. The information consists of the source file(s) name(s), the number of instances (n-data), number of attributes (n-atts), and an 'att-info vector. There is a positional correspondence between the 'att-info vector and the data instance vectors. The elements of 'att-info give the data type (one of *att-types*), a documentation string, and type dependent range information (see real-range-$ and disc-range-$). Two data bases are considered equivalent, relevant to a particular model, if all referenced attributes have the same type and missing value, and discrete types have the same range. The database input functions are in file i/o-read-data.lisp. Databases are stored on a file pair: an xxx.db2 file which contains the number of data and object vectors, and an xxx.hd2 file containing the object descriptions. Files All files of the development system are in Autoclass:.... The program files are in Autoclass:program;, data files in Autoclass:data; and experimental results in Autoclass:results; The program is implemented as a system under the name of autoclass with short name of ac in package ac. The system declaration file is ...>prog>sys-dcl.lisp. Generally those files forming a system module will have names with a common prefix. Operation: See file ...>usage.text for current information on preparing, running, and interpretation of AutoClass.